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FROM WEAK LEARNING TO STRONG 
LEARNING IN FICTITIOUS PLAY 
TYPE ALGORITHMS 

BRIAN SWENSONt*, SOUMMYA KARt AND JOAO XAVIER** 

Abstract. The paper studies the highly prototypical Fictitious Play (FP) algorithm, as well 
as a broad class of learning processes based on best-response dynamics, that we refer to as FP-type 
algorithms. A well-known shortcoming of FP is that, while players may learn an equilibrium strategy 
in some abstract sense, there are no guarantees that the period-by-period strategies generated by 
the algorithm actually converge to equilibrium themselves. This issue is fundamentally related to 
the discontinuous nature of the best response correspondence and is inherited by many FP-type 
algorithms. Not only does it cause problems in the interpretation of such algorithms as a mechanism 
for economic and social learning, but it also greatly diminishes the practical value of these algorithms 
for use in distributed control. We refer to forms of learning in which players learn equilibria in some 
abstract sense only (to be defined more precisely in the paper) as weak learning, and we refer to forms 
of learning where players’ period-by-period strategies converge to equilibrium as strong learning. An 
approach is presented for modifying an FP-type algorithm that achieves weak learning in order to 
construct a variant that achieves strong learning. Theoretical convergence results are proved. 
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1. Introduction. Fictitious Play (FP), introduced in [1], is one of the oldest 
and best-known game theoretic learning algorithms. FP has been shown to be an 
effective algorithm for distributed learning of Nash equilibria in various classes of 
games including two-player zero-sum games [2], generic 2 x m games [3], supermod- 
ular games [4,5], one-against-all games [6], and potential games [7,8]. However, the 
manner in which players learn in FP is often unsatisfactory, especially in the context 
of distributed control. 

In FP, players learn equilibrium strategies in the sense that the time-averaged 
empirical distribution of players’ actions converges to the set of Nash equilibria — 
a form of learning known as convergence in empirical distribution. This notion of 
learning tends to be problematic when the limit set of a learning algorithm contains 
mixed-strategy equilibria. In particular, convergence of the time-averaged empirical 
distribution to a mixed-strategy equilibrium does not imply any form of convergence 
in players’ period-by-period strategies or actions. In practice, players’ period-by- 
period strategies tend to move in progressively longer and longer cycles around an 
equilibrium set—the time-averaged empirical distribution is driven to equilibrium, 
but the period-by-period strategies never approach the equilibrium set themselves. 

In the context of repeated-play algorithms, we refer to convergence of the empiri¬ 
cal distribution (or some function thereof) to an equilibrium set as weak convergence, 
and we refer to any form of learning involving weak convergence as weak learning. We 
refer to the convergence of players’ period-by-period strategies to an equilibrium set 
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as strong convergence, and we refer to any form of learning involving strong conver¬ 
gence as strong learning. Intuitively speaking, weak learning means that players learn 
an equilibrium strategy in some abstract sense (i.e., convergence in empirical distri¬ 
bution) but may never actually implement the strategy they are learning. In strong 
learning, not only do players learn an equilibrium strategy, but they also implement 
it. 

FP is proven to achieve learning only in the weak sense, and thus no guarantees 
can be made regarding the convergence nor optimality of players period-by-period 
strategies. For example, Jordan [9] presents a continuum of games for which FP 
achieves weak learning, yet in all but a countable subset of games, the period-by- 
period strategies produced by FP never approach the game’s unique equilibrium. As 
another example. Young [10] presents a 2 x 2 game in which FP achieves weak learning, 
but the period-by-period actions produced by FP achieve the lowest possible utility 
in every stage of the repeated play (see also Section 3.2). 

Our first main contribution is the presentation of a simple variant of FP that 
converges strongly to equilibrium. In our strongly convergent variant of FP, play¬ 
ers gradually and independently transition from using the FP best response rule to 
determine the next-iteration action, to using their current empirical distribution as 
a probability mass function from which they sample to determine the next-iteration 
action. We show that, for any game in which FP can be shown to converge weakly 
to equilibrium (and for which a certain robustness assumption holds—see A.8), our 
variant of FP will converge strongly to equilibrium. 

One advantage of this approach is that it is readily applicable to more general 
FP-type learning algorithms. Our second (and more general) main contribution is a 
method for taking a weakly convergent FP-type learning algorithm, and constructing 
from it, a strongly convergent variant. We study a general class of FP-type algorithms 
and show that, so long as an algorithm achieves weak learning in a sufficiently robust 
sense (see A.8), then a strongly convergent variant of the algorithm can be con¬ 
structed. As an example of how the general result may be applied, we consider three 
weakly convergent FP-type algorithms—classical FP, Generalized Weakened FP [11], 
and Empirical Centroid FP [12,13]—and construct the strongly convergent variant of 
each. 

1.1. Related Work. An overview of the topic of learning in games can be 
found in [10,14]. Various problems associated with learning mixed-strategy equi¬ 
libria in best-response-type learning algorithms (including FP-type algorithms) are 
discussed in [9]. In particular, the issue of weak convergence is considered, along with 
a discussion of some of the underlying mechanics that lead to weak convergence. 

Many learning algorithms are designed to ensure that their limit points are pure- 
strategy equilibria [15-19]. Ensuring convergence to a pure strategy is a natural way 
of ensuring strong learning, since weak learning can generally only occur when the 
limit set contains mixed strategies. 

In contrast, this paper studies a method of ensuring strong convergence when the 
limit set of the algorithm contains mixed strategies. The ability to (strongly) learn 
mixed equilibria is important for many reasons, the foremost being that, in finite 
games, the set of Nash equilibria (NE) is only guaranteed to be non-empty if mixed 
equilibria are considered. Mixed strategies play an important role when the learned 
strategy needs to be robust to uncertainty in opponent behavior or game structure, 
or secure against the actions of malicious players [6,20-23]. With regards to FP in 
particular, it was recently shown in [24] that, for the class of near-potential games. 
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the limit set of the FP dynamics (weakly speaking) is a neighborhood of a mixed 
equilibrium. 

Regret-testing algorithms [25], [26] achieve strong convergence to mixed-strategy 
equilibria in generic finite games. However, such algorithms operate on fundamentally 
different principles from FP-type algorithms—players implement a form of exhaustive 
search to coordinate on a NE strategy. Such algorithms tend to have slow convergence 
rates, especially when the number of players or available actions is large. 

Stochastic FP (SFP)—introduced in [27]—was proposed as a learning mechanism 
that could (i) mitigate the problem of weak convergence to mixed equilibria in FP and 
(ii) provide a reasonable explanation for why real-world players might learn mixed- 
strategy equilibria. In SFP, the issue of weak convergence is addressed by smoothing 
each player’s best response correspondence with the addition of small random shocks 
or perturbations. The stable points of SFP are not Nash equilibria, but rather Nash 
distributions. The set of Nash distributions converges to the set of Nash equilibria as 
the size of the perturbations goes to zero [27]. SFP has been shown to obtain strong 
convergence to the set of Nash distributions in various classes of games [8,14,28]. 
Moreover, if the perturbations are permitted to gradually decay throughout the course 
of the repeated play, then SFP converges to the set of NE [11]. 

In contrast to SEP, the present work does not consider the descriptive agenda 
of providing an explanation for why real-world learners might act according to a 
given behavior rule. Furthermore, we present a simple and intuitive procedure for 
modifying a variety of weakly convergent learning algorithms in order to obtain a 
strong convergent variant. From a technical perspective, the current work differs 
from SFP in that the best response correspondence is not directly smoothed in any 
way. 

The work [11] by Leslie et al. studies a useful generalization of FP termed Gen¬ 
eralized Weakened FP (GWFP). Among other contributions, the paper demonstrates 
that the convergence of FP is not affected by asymptotically decaying perturbations 
to players’ best response sets. This result provides a cornerstone for our proofs by 
ensuring that FP (and GWFP) meet the critical robustness assumption A.8. We 
study a strongly convergent variant of GWFP in Section 6.2. Furthermore, [11] also 
presents a payoff-based, actor-critic learning algorithm based on GWFP that achieves 
strong learning. Our work differs from this in that we provide a general method 
for constructing a strongly convergent algorithm from a weakly convergent one in a 
setting where instantaneous payoffs information may or may not be available. 

Our preliminary results on strong convergence in FP is found in [29]. The present 
work expands on [29] by considering algorithms beyond classical FP and establishing 
more general conditions under which convergence can be attained (in particular, see 
A.1-A.3). Furthermore, [29] contains a gap in reasoning in the proof of Lemma 2 
which the present paper fills in. 

The remainder of the paper is organized as follows. Section 2 sets up notation 
to be used in the subsequent development. Section 3 introduces classical FP and 
discusses the problem of weak convergence in classical FP. Section 4 presents the 
strongly convergent variant of classical FP and states the strong convergence theorem 
for classical FP. Section 5 presents the general notion of an FP-type algorithm, then 
presents the strongly convergent variant of an FP-type algorithm, states the general 
strong convergence result in the context of an FP-type algorithm, and presents the 
proof of the result. In Section 6, the general result is applied to prove strong conver¬ 
gence in classical FP, Generalized Weakened FP, and Empirical Gentroid EP. Section 
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7 concludes the paper. 

2. Preliminaries. 

2.1. Setup and Notation. A game in normal form is represented by the triple 
r := {N, (Yi, Ui)igAr), where TV = {1,..., n} denotes the set of players, Yi denotes the 
finite set of actions available to player i, and Ui : OiGAf ^ R denotes the utility 
function of player i. Denote hy Y rijeAT Yi the joint action space. 

In order to guarantee the existence of Nash equilibria it is necessary to consider the 
mixed extension of V in which players are permitted to play probabilistic strategies. 
Let rrii := \Yi\ be the cardinality of the action space of player z, and let := {p G 
R™’ . J2T=iPi^) — — 0 denote the set of mixed strategies available to 

player i —note that a mixed strategy is probability distribution over the action space 
of player i. Denote by A" := JliGAr joint mixed strategies. 

In this context, we often wish to retain the notion of playing a deterministic 
action. For this purpose, let Ai := {ei,..., 6^;} denote the set of “pure strategies” of 
player z, where Cj is the j-th cannonical vector containing a I at position j and zeros 
otherwise. 

The mixed utility function of player z is given by Ui{p) := ^j(2/)7'i(y) ■ ■ - Pniy), 

where Ui : A" R. When convenient we sometimes write Ui{p) as Ui{pi,p-i), where 
Pi denotes the mixed strategy of player i and p-i denotes the mixed strategies of all 
other players. The set of Nash equilibria is given by NE := {p G A" : Ui{pi,p-i) > 
Ui(j)i,p-i), yp'i G Ai, Vz G N}. Let 

BRKp^i) := {ai G Ai : U{ai,p-i) > max U{ai,p-i) - e} (2.1) 

oii&Ai 

be the z-th players set of e-best responses to a strategy profile p-i adopted by the 
other players. Note that in this definition we only consider pure-strategy e-best re¬ 
sponses. Denote by Vi{p-i) := maxp^gA; Ui{pi,p-i), the value obtained by playing a 
best response. 

Throughout, we assume there exists a probability space (D, P) rich enough to 
carry out the construction of the various random variables required in this paper. For 
a random object X defined on a measurable space (D,J^), let cf{X) denote the cr- 
algebra generated by X [30]. As a matter of convention, all equalities and inequalities 
involving random objects are to be interpreted almost surely (a.s.) with respect to 
the underlying probability measure, unless otherwise stated. 

2.2. Repeated Play. Suppose players repeatedly face off in the game F. Denote 
by t G {1, 2,...} a round of the repeated play. Let {ai{t)}t>i denote the sequence of 
actions taken by player z, where ai{t) G Ai, and let {a(t)}t>i, a{t) = (ai(t),..., a„(t)) 
denote the sequence of joint actions. 

Let {Xt\t>i be a filtration (sequence of cr-algebras) that contains the informa¬ 
tion available to players in round t of the repeated play. For t > 1 and ai G 
Ai, let g{ai, t) G R be an -measurable random variable with gi{ai, t) := 
F{ai{t) = ai\Ft-i), and let gi{t) G A^ be the vector with components gi{t) := 
(giicti, t),..., giiotmi, t)), where rrii is the cardinality of Ai. We say gi(t) is the 
mixed strategy used by player i in round t, and we say {gi{t)}t> is the sequence of 
period-by-period (mixed) strategies used by player i. The sequence of joint period- 
by-period strategies is given by {g{t)}t>i, g{t) := {gi{t ),... , 5 „(t)). 

Denote by qi(t) G A^, the empirical distribution of player z. The precise manner 
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in which the empirical distribution^ is formed will depend on the algorithm at hand. 

In general, qi{t) is formed as a function of the action history { 0 ^( 5 )serves 
as a compact representation of the action history of player i up to and including the 
round t. The joint empirical distribution is given by q{t) := {qi{t),... ,qn{t)). 

Unless otherwise stated, d{-, ■) denotes the standard Euclidean norm. For m > 1 
and S C R™ define the distance fromp G R™ to S' C R™ by d(p, S) := inf{(i(p, p') : p' G 
S}. We say a repeated-play learning process converges weakly to equilibrium if for 
some map / : A" —>• A" there holds d{f{q(t)), NE) —>■ 0 as t —>■ 00 . In most 
cases in this paper, / will simply be the identity function. We say a repeated-play 
learning process converges strongly'^ to equilibrium if d{g(t), NE) ^ 0 as t —>■ 00 . 
Note that weak learning implies that players learn an equilibrium strategy, but may 
never actually begin to implement the strategy that is being learned. On the other 
hand, in strong learning players both learn an equilibrium strategy, and implement 
the strategy that is being learned (see Section 3.2 for more details). 

3. Fictitious Play. 

3.1. Fictitious Play. Let 

1 . 

■■= T^^aiis), (3.1) 

S — 1 

be the normalized histogram^ of the actions of player i. 

FP may be intuitively understood as follows. Players repeatedly face off in a stage 
game P. In any given stage of the game, players choose a next-stage action by assuming 
(perhaps incorrectly) that opponents are using stationary and independent strategies. 
Thus, in FP, players use the marginal empirical distribution of each opponent’s past 
play, qi(t), as a prediction of the opponent’s behavior in the upcoming round and 
choose a next-round strategy which is a best response against this prediction. 

A sequence of actions {a(t)}t>i such that^ 

ai{t -I- 1) G BRi{q^i{t)), Vi, (3.2) 

for all t > 1, is referred to as a fictitious play process. FP has been studied extensively 
to determine the classes of games for which it can be said to converge (weakly) to 
the set of Nash equilibria. Among other results, it has been shown that FP leads 
to weak learning in two-player zero-sum games [2], potential games [7], and generic 
2 X m games [3]. We summarize these results in the following theorem. 

Theorem 3.1. Let P = (N,{ui{-)}i^N,Y'^) be a two-player zero-sum game, 
potential game, or generic 2 x m game, and let {a(t)}t>i be a fictitious play process 
on P. Then d{q{f), NE) —>^0 as t ^ 00 . 

3.2. Weak Convergence in Fictitious Play. The following example (see [10], 
p. 78), while fairly simple, clearly illustrates the phenomenon of weak convergence in 


^The term empirical distribution is often used to refer explicitly to the time-averaged histogram 
of the action choices of some player i; i.e., qi{t) = j “'i(*)• Here, we allow for a broader 

definition that will permit interesting and useful algorithmic generalizations. 

^The notion of strong convergence presented in this paper is comparable to the notions of “con¬ 
vergence in intended behavior” presented in [27] and “convergence in strategic intentions” given 
in [10]. 

^Recall that the actions ai{t) G Ai are dirac distributions in the mixed-strategy space A^. 

"'in all variants of FP discussed in this paper, the initial action ai(l) may be chosen arbitrarily 
for all i. 
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FP, and demonstrates why weak convergence can be a deeply unsatisfactory notion 
of learning. 

Consider the two-player asymmetric coordination 
game shown in Figure 3.1. The game has three Nash 
equilibria: both players play A, both players play B, 
and an asymmetric mixed-strategy Nash equilibrium. 
The game is a potential game [7] (in fact, an iden¬ 
tical interests game [31]) and hence falls within the 
purview of Theorem 3.1—regardless of the initial con¬ 
ditions, players engaged in an FP process will learn an 
equilibrium in the weak sense that d{q{t), NE) —>• 0 
as t —>■ oo. 

Suppose that the players are engaged in an FP 
process on this game, and in the first round they mis- 
coordinate their actions (e.g., one chooses A, and the other chooses B). Young [10] 
shows the somewhat counterintuitive result that the FP dynamics will in fact lead 
players to miscoordinate their action choices in every subsequent round of the learning 
process. Thus, despite the fact that limt_>oo d(9(i)i NE) = 0, the players’ realized 
action choices are extremely suboptimal—yielding the lowest possible utility in each 
round of play. Intuitively speaking, this phenomenon occurs when players’ actions 
cycle in such a way as to drive the time-averaged empirical distribution to a mixed- 
strategy Nash equilibrium, yet player’s period-by-period strategies never constitute 
(nor even approach) a Nash equilibrium themselves. 

It may be said that in weak learning players “learn” a NE strategy in some 
abstract sense, but never actually implement the strategy they are learning. In strong 
learning, players not only learn a NE strategy, but they also physically implement the 
strategy that is being learned. 

The following section presents a simple modification of EP that achieves strong 
learning; i.e., players’ period-by-period strategies converge to equilibrium in addition 
to convergence of the empirical distributions. 

4. Strong Convergence in Classical Fictitious Play. Consider a variant of 
EP in which the action for player i at time t is chosen by drawing a random sample 
from the mixed strategy (i.e., probability distribution) gi(t), where 

gi{t) G BRi{q_i{t - l))pi{t) + qi{t - 1)(1 - Pi{t)), (4.1) 

Pi{t) G [0,1], and limt_>ooPi(0 = 0- Intuitively, this is similar to the classical EP 
process (3.2), but rather than playing a deliberate best response each round, players 
gradually transition toward drawing their stage t action as a random sample from 
their own empirical distribution, qi{t). 

The idea is that players will play a best response sufficiently often so that, per 
EP, the empirical distribution q(t) will be driven toward equilibrium, as in Theorem 
3.1. Then, since pi{t) —0 as f —?> oo, the mixed strategy gi(t) tends towards qi{t), 
which is itself tending towards equilibrium. Informally, (4.1) captures the main idea 
of strongly convergent FP. A formal presentation of the algorithm is given below. 

4.1. Strongly Convergent Variant of Classical FP. Consider a variant of 
FP in which the action for player i at time t is chosen according to the following 
randomized rule: 


A B 


V2,l 

0,0 

0,0 

1,V2 


Fig. 3.1 
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a^{t) g[{t) ■■ 


bi{t - 1), if X,{t) = 1, 
qi(t — 1), otherwise, 


(4.2) 


where bi{t — 1) G BRi{q-i{t — 1)), the notation ai{t) ~ g[{t) indicates that the 
action ai{t) is drawn as a random sample® from the probability mass function g[{t), 
Xi(t) G {0,1} is a random variable, and qi{t) is the player’s empirical distribution as 
defined in (4.4) below. Let Ft := CT({a(s), Xi(s),..., X„(s), &i(s),..., 6„(s)}s<t), and 
note that ^'(t) is J^t-measurable. Let 

p,{t) :=P(X,(f) = 1| Ft-i), 

and note that pi{t) is J^t_i-measurable. Intuitively speaking, pi(t) represents the 
probability that player i deliberately chooses to play a best response strategy in 
round t given the history of play up through the previous round. We make the 
following assumptions regarding each player’s probability of deliberately choosing a 
best response: 

A. 1. lim pi(t) =0, Vi G N, a.s., 

t—^OO 

A. 2. ^ pi{t) = oo, Vz € N, a.s., 

t>i 

A. 3. lim = 1, \/i,j G N, a.s. 

The first assumption ensures that players eventually transition towards playing 
their next-stage action as a sample from their empirical distribution rather than play¬ 
ing a deliberate best response. The second assumption ensures that, for each player, 
a deliberate best response is played infinitely often. The third assumption ensures 
that the number of deliberate best responses taken by each player remain relatively in 
sync.® In practice, players may choose their deliberate best responses completely asyn¬ 
chronously; for example, setting pi{t) = VI, with r G (0,1], results in (purely) 

independent sampling of deliberate best response rounds and secures A.1-A.3. 

Let , 

£,(t) :=^A,(fc) (4.3) 

k=l 

count the number of times player i has deliberately played a best response until and 
including round t. Note that ii{t) is J^t-measurable. The empirical distribution qi{t) 
is defined recursively as^ 

qi{t + 1) = qi{t) + + 1) - <?j(0) + !)■ (4.4) 

Intuitively speaking, the empirical distribution (4.4) is updated only over rounds when 
a deliberate best response was played. Note that qi{t) is J't-measurable.® 

®The action ai(t) € At is technically a dirac distribution over the finite action space Yi (see 
Section 2), and the mixed strategy g[(t) is a probability distribution over Yi. More precisely, the 
notation ai(t) ~ 9i(i) means that an action yi{t) is drawn as a random sample from with 

ai(t) := where = 1 if yi = yi{t) and Sy.(^t^{yi) = 0 otherwise. 

®Note that since pi{t) is only required to be ^t_i-measurable, this parameter is in fact adaptively 
tunable. This is a feature of practical interest since it allows players to adjust their deliberate best 
response rates on the fly—possibly adapting to the (initially unknown) deliberate best response rates 
of others and to underlying process dynamics—in order to satisfy A.1-A.3. 

^To initialize the process, let the action ai(l) be chosen arbitrarily, let qi(l) = Ui(l), and let 
Ai(l) = 1 for all i. 

®Note that, (4.2) implicitly assumes that players have knowledge of the empirical distributions 
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Finally, let 


gi{t) ■■= bi{t - l)pi{t) + qi{t - 1)(1 - Pi{t)), (4.5) 

and note that gi(t) is iFt-i measurable.® More importantly, note that for every 
tti € Ai, giioiiA) = IP(ai(i) = Oii\ J-t-i), and thus giit) represents the mixed strategy 
(conditioned on past play) used by player i in round t. The joint mixed strategy used 
in round t is given by g{t) := {gi{t ),... ,g„(<)). 

We refer to a process where, for each player i, ai(t) is updated according to (4.2), 
qi{t) is updated according to (4.4), and gi{t) is updated according to (4.5) as the 
strongly convergent variant of (classical) FP (for reasons to be clear soon). 

4.2. Strong Convergence in Classical FP: Main Result. The following 
result states that in the strongly convergent variant of FP, players’ period-by-period 
mixed strategies converge to the set of Nash equilibria—i.e., strong learning is achieved. 

Corollary 1. LetT be a two-player zero-sum game, potential game, or generie 
2 X m game. Assume A.1-A.3 hold. Then the strongly convergent variant of FP 
achieves strong learning in the sense that d{g{t), NE) = 0 almost surely. 

In order to prove the above result, we first study a more general notion of fictitious 
play and then prove the result as a corollary of the general theorem (see Theorem 
5.1). Taking this general approach allows our strong convergence results to be be 
applied to other FP-type algorithms, e.g., Generalized Weakened FP (Section 6.2) 
and Empirical Centroid FP (Section 6.3). The proof of Corollary 1 is given in Section 
6 . 1 . 


4.3. Simulation Example. In order to demonstrate the learning properties of 
strongly convergent FP, we simulated classical FP and strongly convergent FP in a 
simple two-player matching pennies game with utility functions as shown in Figure 
4.1a. The game has a unique (symmetric) mixed-strategy equilibrium in which both 
players choose either action with probability 1/2. Figure 4.1b shows the period- 
by-period strategies generated by classical FP. Players’ strategies are always pure 
and progress in continuously lengthening cycles. While the time-averaged empirical 
distribution is being driven to equilibrium, the period-by-period strategies clearly are 
not. 

Figure 4.1c shows the period-by-period strategies generated by strongly conver¬ 
gent FP with p{t) = t~-^^. Players’ period-by-period strategies are converging to the 
unique Nash equilibrium of the game. 

Figure 4.Id shows the utility received by the realized joint action a{t) in each 
round of repeated play for both learning algorithms. The received payoffs in clas¬ 
sical FP cycle around the value of the game, while the received payoffs in strongly 
convergent FP converge to the value of the game. 

One possible tradeoff in strongly convergent FP is that less frequent deliberate 
best response actions and less frequent updating of the empirical distribution (see 


of opponents when computing a best response. This may be accomplished by assuming that players 
actions are accompanied with a “tag” indicating whether or not the played action was a deliberate 
best response. Alternatively, the information regarding qi{t) may tracked by the individual player i 
and disseminated by a gossip-type algorithm [12] or implicitly disseminated through a payoff-based 
scheme. 

^To see this, note first that qi{t — 1) and pi{t) have been shown to be J^t—i measurable. Fur¬ 
thermore, this implies that BRi(qi{t — 1)) is 7^t_i-measurable. Lastly, by construction, bi{t) € 
BRi(qi(t — 1)) is .Ft—i-measurable. 




(d) 


(e) 


Fig. 4.1: 4.1a: Matching pennies payoff matrix, 4.1b: The probability of each player playing 
heads in round t using the classical FP algorithm, 4.1c: The probability of each player playing 
heads in round t using the strongly convergent FP algorithm, 4.Id: The received utility in 
round t given the realized action a{t), 4.1e: The empirical distribution process of the action 
H (heads) for player 1 in both FP and strongly convergent FP. 


(4.4)) may lead to a slow-down in convergence rate. The empirical distribution pro¬ 
cesses for player 1 in each algorithm is shown in Figure 4.1e with p{t) = 

5. General Setup. In this section we study strong learning in FP-type algo¬ 
rithms —a class of algorithms that generalizes FP and includes many learning pro¬ 
cesses based on best-response dynamics.In Section 5.1, we define the notion of an 
FP-type algorithm. In Section 5.2 we present some examples of an FP-type algorithm. 
In Section 5.3 we define the strongly convergent variant of an FP-type algorithm. In 
Section 5.4 we provide the general strong convergence result for an FP-type algorithm 
(see Theorem 5.1), and in Sections 5.5-5.7 we prove the general result. 

5.1. FP-Type Algorithm. An FP-type algorithm generalizes classical FP in 
the following ways: (i) the notion of a player’s empirical distribution is generalized, 
(ii) players are permitted to use a function of the empirical distribution (rather than 
use the empirical distribution itself) as a predictor of the next-round strategy of op¬ 
ponents, (iii) convergence to equilibrium may occur in terms of a function of the 
empirical distribution (rather than convergence to equilibrium of the empirical distri¬ 
bution itself), and (iv) limit sets other than the set of NE are permitted. 

We define an FP-type algorithm as follows. Let players be engaged in repeated 
play of a stage game F. Let ai{t) represent the action of player i in round t G 
{1, 2,...}, and let Hi{t) := { 0 ^( 5 )}*^^ represent the action history of player i up to 
and including round t. 

class of FP-type algorithms proposed here is similar in spirit to the class of best-response- 
based algorithms considered in [9]. 
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In classical FP, for each player i, the normalized histogram of the player’s action 
choices (3.1) is used as a compact representation of the player’s action history. In 
the general formulation of an FP-type algorithm, we still suppose that players track a 
compact representation of the action history, but we allow the compact representation 
to take on a fairly general form,^^ as stated in the following assumption: 

A. 4. The empirical distribution of player i is of the form qi{t) := ff{Hi{f), t), 
where /«(•, f) : OLi A^. We make the following assumption regarding the 

sequence of functions {//(•, used to form the empirical distribution sequence 

of player i\ 

A. 5. For any history sequence {i?i(t)}t>i for player i, there holds limt^oo \\fi{Hi(t+ 
1), t + l)-/f(i/,(<), t)||=0. 

In particular, this implies that—regardless of the action history—there holds 
limt_).oo Il9i(^ + 1) ~ 9i(0ll =0 each player i. This fairly mild assumption captures 
the essential characteristics required for our asymptotic analysis, and may be seen 
as a generalization of classical FP where exact averaging of actions over time yields 
\\qi{t + 1) — gi(t)|| < 7 (see Section 5.2.1). Together, assumptions A.4-A.5 allow 
us to consider a variety of FP inspired algorithms, including those with general step 
sizes [11] and those with more intricate history dependent rules such as derivative 
action [32]. 

In an FP-type algorithm, players form a prediction of the future behavior of 
opponents as a function of the current empirical distribution. Let Pi{t) be player i’s 
prediction of opponent strategies for the upcoming round it -I-1). We assume, 

A. 6. Player i’s prediction Pi(t) of opponent behavior is of the form Pi(f) = 
ff{q{t)), where /f : A” —> A_i is a Lipschitz continuous, time-invariant function. 

We say a sequence of actions {a(t)}t>i is an FP-type process if for alH £ A and 
all t > 1, ai{t -b 1) G BRf{pi{f)), where is the ej-best response set (recall 

(2.1)), and {et}t>i is a sequence satisfying limt_>oo ct = 0. 

In many variants of FP, including classical FP, learning occurs in the sense that 
d{q{t), NE) —>• 0. We generalize this notion of learning by allowing for limit sets 
other than the set of NE and allowing for convergence in terms of a function of q{t) 
rather than permitting convergence only in terms of q{t) itself. 

Let E be some target equilibrium set (not necessarily the set of NE). An FP-type 
process is said to learn elements of E if for each i there exists a function ff satisfying: 

A. 7. The function ff : A” ^ Ai is Lipschitz continuous and time invari¬ 
ant, and such that, for := fi{q{t)) and ^{t) := (^i(t),..., ^„(t)) there holds 
d{£,{t), E) = 0. We refer to f{f) as the asymptotic learning distribution, and 
/f as the convergence map of player i. 

In general, we will denote an instance of an FP-type learning algorithm by d' = 
fi ^ fi)i&N- In order to construct a strongly convergent variant of 4/ 
we will require that 4^ obtain weak convergence in sufficiently robust sense as stated 
in the following assumption. 

A. 8. For the stage game F and equilibrium set E, the FP-type algorithm d' is 
such that for any sequence (et)t>i satisfying limt_>.oo e* = 0, the FP-type algorithm 4' 

most literature, the notion of an empirical distribution refers strictly to the time-averaged 
empirical histogram of a player’s action choices, as in classical FP (3.1). However, as discussed in 
Section 2, we use the term empirical distribution more generally to refer to an arbitrarily formed 
(see A.4) distribution that a player uses to track information regarding opponents’ empirical action 
histories. This abuse of terminology allows us to more naturally extend concepts to the general 
FP-type setting. 
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obtains weak convergence in the sense that limt^g d{^{t)^ E) = 0. 

The above assumption ensures that the FP-type algorithm is robust to asymp¬ 
totically decaying perturbations in a player’s best response set. When studying the 
strongly convergent variant of in the following section, the assumption A.8 will serve 
to ensure that convergence of the process is not disrupted by minor asynchronies in 
the number of deliberate best responses taken by each player (i.e., minor disparities 
in (4.3)). 

5.2. Examples. 

5.2.1. Classical Fictitious Play. Classical FP (Section 3.1) fits the template 

of an FP-type algorithm with qi{t) = j Oi('S)- Note that qi{t) may be written 

in recursive form as: qi{t + 1) = qi{t) + 1 /{t + 1) {ai{t + 1) — qi{t)). Thus, ||gi(t-|-l) — 
qi{t)\\ < where Mi := sup^/Up' —p'/W, and A.5 is satisfied. The prediction 

map ff is given by the identity function, and convergence map ff also given by the 
identity function. The target equilibrium set is given by E := NE, the set of Nash 
equilibria. 

5.2.2. Generalized Weakened Fictitious Play. Leslie et al. [11] study a use¬ 
ful generalization of FP, termed Generalized Weakened FP (GWFP), in which players 
are permitted to choose a suboptimal best response each round, so long as the degree 
of suboptimality decays asymptotically to zero, and in which step-size sequences other 
than {l/t}t>i are permitted. 

Formally, for G A_i and e > 0, let^^ B~Rj{p_i) :=_{pi G : Ui{pi,p-i) > 
maxa.gAi Ui{ai,p-i)-e}, and forp G A”, let BR<^\p) := (Bi?f(p_i),..., Bi?^(p_„)). 
A sequence {g(t)}t>i is said to be a GWFP process if q{t -|- 1) G (1 — "f{t + l))q{t) -I- 
'y{t + l){BR<^^\q{t)) + Mt+i) with ^{t) —)■ 0 and et ^ 0 as t > oo, X]t>i 7(0 = 
oo, and {Mt}f>i is a deterministic (or stochastic) perturbation sequence satisfying 
lim supJII J2i=t %+iMi+i\\ : H+i < P} = 0 (a.s.). 

t—¥00 

We consider a special case of GWFP in which Mt = 0, Vt and the e-best response 
set is restricted to the set of pure strategy e-best responses. That is, we consider the 
subset of GWFP process such that a{t -f 1) G BR'^* {q-iit)), and, 

q{t + 1) = q{t) + j{t + 1) (a(t -f 1) - g(0), (5.1) 

with et —>■ 0, and in a slight variation of terminology we refer to the sequence of 
actions {a(0}t>i satisfying the above as a GWFP process. 

In the terminology of Section 5.1, GWFP fits the template of an FP-type algo¬ 
rithm with the empirical distribution qi{t) defined recursively as in (5.1) (where it is 
assumed that limt-too 7(1) = 0), the prediction map ff given by the identity function 
for all i, and the convergence map ff given by the identity function for all i, and the 
target equilibrium set is given by E := NE —the set of Nash equilibria. 

5.2.3. Empirical Centroid Fictitious Play—Learning Consensus Equi¬ 
libria. Empirical Gentroid FP (EGFP) was conceived as a variant of FP suited to 
implementation in large-scale games [12,13]. In EGFP, rather than tracking the em¬ 
pirical distribution of each individual opponent (as in FP), players track and respond 


^^The set defined below differs from the set BR‘{p—i) defined in the preliminaries in 

that B~R^{p—i) includes all mixed strategy best responses, whereas BR^{p—i) contains only the pure 
strategy best responses. The set BR^{p—i) is used here in order to precisely define a GWFP process 
as given in [11], but the remainder of the paper focuses on the set BRj{p—i). 
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to only the centroid of the empirical distributions. In order to ensure the process is 
well defined the following assumption is made: 

A. 9. All players use the same strategy space. Under this assumption, let the 
empirical distribution be defined by 



(5.2) 


and let the empirical centroid distribution be defined by q{t) := A qi{t). We say 

a sequence of actions {a(t)}t>i is an ECFP process if for all i and all t > 1, 


+ 1) e BRl\q-i(t)), 


(5.3) 


where g_*(t) = {q{t),..., q{t)) e is the (n — l)-tuple containing (n — 1) 

repeated copies of q{t), and {et}t>i is a sequence satisfying limt_).oo e* = 0. 

In ECFP, players learn elements of the set of consensus Nash equilibria^^, defined 
by C := {p = {pi, ... ,pn) S NE : pi = p 2 = ... = p„}, the subset of Nash 
equilibria in which all players use identical strategies (see [12] for more details). Define 
g”(t) := {q{t),..., q{t)) G A" to be the n-tuple containing repeated copies of g(t); 
learning in ECFP takes place in the sense that limt_).oo d{(f^{t), C) = 0. 

In the terminology of Section 5.1, ECFP fits the template of an FP-type algo¬ 
rithm with the empirical distribution given by (5.2), the prediction map ff given 



(n — l)-tuple containing repeated copies of q{t), and the convergence map given by 
fi{q{t)) := A The target equilibrium set is given by E := C, the set 

of consensus Nash equilibria. 

5.2.4. Empirical Centroid Fictitious Play—Learning Mean-Centric Equi¬ 
libria. In this section we consider a slight modification of the ECFP algorithm pre¬ 
sented in Section 5.2.3 that enables players to learn elements of an alternate (non- 
Nash) equilibrium set. 

Let an ECFP action process be defined as in (5.3). Define the set of mean-centric 
equilibria by MCE := {p € A" : Ui{pi, p-i) > Ui{p[, p-i) Vp' G A*, Vi}. The set 
of MCE is neither a superset nor a subset of the NE—rather, it is a set of natural 
equilibrium points tailored to the ECFP dynamics [33]. The set of consensus Nash 
equilibria C (see Section 5.2.3) however, is contained in the set of MCE. 

In ECFP, players learn elements of MCE in the sense that limt^oo d{q{t), MCE) = 
0. In the terminology of Section 5.1, this fits the template of an FP-type algorithm 
with qiit) given by (5.2), ff defined in the same way as in Section 5.2.3, the conver¬ 
gence map ff given by the identity for all i, and the target equilibrium set given by 
E := MCE. 

Note that the only difference between the ECFP algorithm discussed in the Sec¬ 
tion 5.2.3 and the ECFP algorithm discussed here is the choice of target equilibrium 
set E and convergence maps ff. 

5.3. Strongly Convergent Variant of an FP-type Algorithm. In this sec¬ 
tion we construct the strongly convergent variant of an FP-type learning algorithm. 


assume here that the set of consensus Nash equilibria is non-empty. When revisiting ECFP 
in Section 6.3, we provide an assumption on the utility structure that ensures that the set is indeed 
non-empty. 
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The construction here is a generalization of that of Section 4.1 where we constructed 
the strongly convergent variant of classical FP. 

Let vf. = ({//(., be an FP-type learning algorithm. For 

each i € N, let {Xi{t)}t>i be a sequence of random variables with Xi{t) € {0,1}. 
Analogous to Section 4, Xi{t) = 1 will serve to indicate that player i took a deliberate 
best response in round t. Let 

t 

(5.4) 

S=1 

count the number of deliberate best responses taken by player i through t. 

In Section 4.1 the empirical distribution of player i, (4.4), is a time average taken 
only over rounds when player i took a deliberate best response. In order to generalize 
this notion to an FP-type algorithm, define the term 

Ti(s) := inf{t : £i(t) = sj. (5.5) 

For s > 1, Ti(s) indicates the round when player i took their s-th deliberate best 
response,and the sequence {ri(s)}s>i gives the subsequence of rounds when player 
i took a deliberate best response. Fortejl, 2,...} let Hi{t) := {ai(ri(s)) : Ti{s) < t} 
denote the action history of player i. Note that H (t) records only the history of actions 
that were taken as deliberate best responses. Let the empirical distribution of player 
i at time t be formed as 

■= (5.6) 

Let the asymptotic learning distribution (see A.7 and subsequent discussion) be given 
by ^z{t) := fHq{t)) and ^(t) := (^i(t),..., ^*(t)). 

Let the action for player i in round t > 2 be chosen according to the random 
rule^® 


ai{t) 


m := 


bi{t - 1 ), 

6(i-l), 


ifA,(t) = l, 
otherwise. 


(5.7) 


where pi{t — 1) = ffiqit — 1)), and bi{t — 1) G BR^*{pi{t — 1)), and assumed® 

A. 10. The sequence associated with bi(t) o/(5.7) is such that lim qt = 0. 

t—¥OC 

Let Xt := cr({a(s), Ai(s),..., A„(s), 6i(s),..., 6„(s)}s<t). Let the probability that 
player i chooses a deliberate best response in round t conditioned on past events be 
given by pi{t) := F{Xi{t) = l|J^t_i), and assume A.1-A.3 hold. Note that qi{t),pi{t), 
fi(t), and gl(t) are J^t-measurable and that by definition, pi{t) is J^t_i-measurable. 
Finally, let 


gi{t) := b,{t - l)pz{t) + ^z{t){l - Pz{t)). (5.8) 

Note that gi{t) is -measurable and that g{ai,t) = F{ai{t) = ai\Tt-i)] that is, 
gi(t) represents the mixed strategy in use by player i in round t (compare with (4.5)). 


^■^Note that by (5.10), Ti{s) is finite valued a.s. for any s G {1, 2,...}. 

initialize the process, let the action ai(l) be chosen arbitrarily, let Xi{l) = 1, and let 
^(1) = ai(l) for all i. 

^®Note that this assumption subsumes the more typical assumption that rjt = 0, V£. By making 
this more general assumption we are able to handle interesting scenarios that may arise in a practical 
implementation of the algorithm; e.g., players have some asymptotically decaying error in their 
knowledge of their utility function or knowledge of opponent’s empirical distributions. 
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Let g{t) := (51 (t), • ■ • ,gn{t)) denote the joint mixed strategy in use at time t. 

We refer to a process where, for each player i, qi{t) is updated according to (5.6), 
ai{t) is updated according to (5.7), and gi{t) is updated according to (5.8) as the 
strongly convergent variant of 'h (for reasons to be clear soon—see Theorem 5.1). 
In Section 6 we will demonstrate applications of this in the context of the previous 
examples. 

5.4. General Result. The following theorem provides the general result from 
which the strong convergence of various FP-type algorithms can be derived. 

Theorem 5.1. Let T be a finite normal form game, let E he an equilibrium set, 
and let 4* be an FP-type algorithm satisfying A. 4 -A. 8 . If the strongly convergent 
variant of d' satisfies A.1-A.3 and A. 10 then it achieves strong learning in the 
sense that limi_,.oo d{g(t), E) = 0, almost surely. 

We emphasize that in the above result players’ period-by-period mixed strate¬ 
gies g{t) are converging to equilibrium. In general, when seeking to construct the 
strongly convergent variant of some FP-type algorithm 4', the most challenging as¬ 
pect of applying Theorem 5.1 is the verification that 4^ satisfies A. 8 . The remaining 
assumptions A.4-A.7 are generally fairly trivial to verify. Assumptions A.1-A.3 
and A. 10 pertain to the manner in which the strongly convergent variant of 4^ is 
constructed and are not related to intrinsic properties of 4* itself. 

5.5. Some Additional Definitions. In order to prove Theorem 5.1 we will 
study the behavior of an underlying FP-type process that is embedded in the action, 
history, and empirical distribution processes produced by the strongly convergent 
variant of 4'. In particular, for i G A and s G {1, 2,...}, let Ti{s) be defined as in (5.5), 
and define the following terms: 0 ^( 5 ) := ai{Ti(s)), a{s) := (ai(s),..., a„(s)), Hi{s) := 
Hi{Ti{s)), qi{s) := q^in{s)), q{s) := (gi(s),..., g„(s)), pi{s) := ff{,q{s)), |(s) := 
(/f (g(s)),..., /|(g(s))). The aforementioned terms (marked with a tilde) correspond 
to to the embedded FP-type process that we will study in the proof of Theorem 
5.1. In particular, for each player i, the sequence {Ti(s)}s>i denotes the subsequence 
of rounds when the player chose to play a deliberate best response. The sequence 
®*('®)s>i i® the action sequence occurring along the subsequence of rounds when player 
i chose to play a deliberate best response. The sequence {Hi{s)}s>i corresponds to 
the action history of player i along the same subsequence. The sequence {gi(s)}s>i 
corresponds to the empirical distribution of player i along the same subsequence; 
in particular, note that by Lemma 7.5 (see appendix), {gi(s)}s>i fits the format 
prescribed by A. 4 for the embedded FP-type process: qi{s) = ff{H{s),s). Finally, 
the term ^(s) is the asymptotic learning distribution associated with the embedded 
FP-type process. 

In studying the embedded FP-type process, it will be important to characterize 
the terms to which players are best responding. With this in mind, note that per 
(5.7), the action at time ri(s-l-1) (in the strongly convergent variant of 4*) is chosen as 
ai(ri(s + l)) G (j)i{Ti{s-\-l) — l)). In order to translate this to the embedded 

FP-type process, define the following terms: qj{s) := qj{Ti{s + 1) — 1), g®(s) := 
(gi(Ti(s-l-l)-l),... ,g„(Ti(s-|-l)-l))pi(s) := ff{q\s)), By construction, the (s-l-l)-th 
action of player i in the embedded FP-type process is chosen as, 

a,(s + I) GBR-"‘‘“+^’(k(s)). (5.9) 

In the embedded FP-type process, the term qj{s) may be thought of as the ‘true’ 
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empirical distribution of player j. The term <7](s) may be thought of as the estimate 
which player i maintains of qj(s), and the term g*(s) (note the superscript) may be 
thought of as player z’s estimate of the joint empirical distribution q{s) at the time 
of player z’s (s + l)-th best response. Finally, the term Pi{s) may be thought of as 
player i’s prediction of opponents next-stage strategy given in particular, note 

that—in the embedded FP-type process—player i chooses their stage (s -I- 1) action 
(5.9) as an asymptotic best response to Pi{s). 

5.6. Some Useful Properties. Let 

fl' := {w : lim —yji 

Efe=i 


By Lemma 7.6 (see appendix), there holds P(n') = 1. In proving Theorem 5.1 we will 
restrict attention to (sample path) realizations in 17'. 

Note that under assumption A. 2, there holds {oj : ^i{t) = oo, Vi} D 17'. 

By the equivalence {w : limi_>oo (t) = oo, Vi} = {w : Xi(t) = 1 infinitely often Vi}, 
there holds {w : Xi(t) = 1 infinitely often Vi} D 17'. Therefore, by the definitions of 
ii and Ti, there holds for any realization in 17', limt_>oo ^i(7) = oo, and 


Ti{s) < oo. Vs G N, (5.10) 

lim Ti(s) = oo. (5.11) 

s—)-oo 

These properties will be useful in the proof of Theorem 1. In particular, the proof 
will frequently make reference to qi{s), or di(s) for arbitrary s G N—the property 
(5.10) ensures that such terms are well defined for any a; S 17'. 

Note also that for any realization in 17', for i S A and s G {1, 2,...}, 


and for i G A and 7 G {1, 2,...} 


Hn{s)) = s. 


(5.12) 


x,{t) = i ^ n{Ut)) = t. (5.13) 

Furthermore, note that Xi{t) = 0 implies that ii(t) = ii{t — 1) and Hi{t) = Hi{t — 1), 
and in particular, 

X,{t) = 0 q,{t) = q,{t-l). (5.14) 

These facts are readily verified by conferring with the definitions of , ii, and Xi. 


5.7. Proof of Theorem 5.1. Proof. Since P(17') = 1 it is sufficient to show 
that the desired result holds for any ui G 17'. Henceforth, we restrict attention to 
realizations w G 17', and for ease of notation suppress the term uj when referring to 
random variables. 

As a first step, we wish to show that lims_,.oo d{^{s), E) = 0. We accomplish this 
by showing that there exists a sequence {es}s>i such that lims_>oo £« = 0 and di(s -I- 
1) G BRl‘ {pi{s)). By assumption A. 8, it will then follow that lims^oo V(f(s), E) = 0. 
To that end, note that by Lemma 7.1 (see appendix), lim \Ui{ai(Ti(s+l)),pi{Ti(s+ 

s—^oo 

1) — 1)) — Vi{pi{Ti{s -b 1) — 1))| = 0, Vz, or equivalently by the definitions of d(s) and 
Pi{s) (see Section 5.5), 

lim \U,{ai{s + l)),Pi{s)) - ni(pi(s))| = 0, Vz. (5.15) 

s—)-oo 

By Lemma 7.3 (see appendix), lims_,.oo Ilex'S) “ 9('S)II = 0. By A.6, it follows that 
lims^oo ||Pi(s) “.Pi('S)|| = 0, which by the Lipschitz continuity of Ui{-) implies that 


15 



linis_^oo |t/i(ai,_Pi(s)) - Ui{ai,p^{s))\ = 0, Va^ S Ai,Vi, and lims_s.oo ki(p*(s)) - 
ni(pi(s))| = 0, Vi. Returning to (5.15) we see that lim |t/i(ai(s+l)),)5i(s))—z;i(pi(s))| = 

s—foo 

0, Vi, i.e., there exists a sequence {es}s>i such that Cs —>■ 0 and ai(s+l) £ BRl‘'{pi{s)). 
It follows by A.8 that 

lim ci(cf(s), E) = 0. (5.16) 

s—)-oo 

We now proceed to show that limi_,.oo ^(^(0) = 0- Let e: > 0 be given. 

By Lemma 7.2 (see appendix) and assumption A. 7, for each i £ N, there exists a 
random time Si > 0 such that Vs > Si, ||C(T'i('S)) ~ C(s)|| < §• Let S = max^lS'^}. 
By (5.16) there exists a random time S such that Vs > S' , d{^{s), E) < |. Let 
S = max{S , S }. Then 

d(e(r,(s)), E) < e, Vi, Vs > S. (5.17) 

Let T = maxi{Ti(S)}. Note that for some i, ^(T) = i(ji{S)), and therefore by 
(5.17), 

d(C(T), E) < e. (5.18) 

Also note that for any to > T, it holds that £i(to) > S (since ii{Ti{S)) = S, and 
ii{t) is non-decreasing in t), and moreover 

A*(to) = 1 for some i => g(to) = g(r*(£*(to))) ^ ^(to) = C(Ti(^i(to))), 
Ai(to) = 0 for all i g(to) = g(to - 1) ^(to) = ^(^0 - 1), (5.19) 

where the first implication holds with with £i(to) > S. In the above, the first 
line follows from (5.13), and the second line follows from (5.14). Consider t > T. If 
for some i, Xi{t) = 1, then by (5.19) and (5.17), c?(C(t), E) = d{^{Ti{£i(t))), E) < e. 
Otherwise, if Aj(t) = 0 Vi, then ^(t) = ^(t — 1). 

Iterate this argument m times until either (i) Xi(t — m) = 1 for some i, or (ii), 
t — m = T. In the case of (i), d(^(t), E) = d{^{t — m), E) = d{^{Ti{£i{t — m))), E) < e, 
where the inequality again follows from (5.17) and the fact that t — m>T => £i{t — 
m) > S. In the case of (ii), d{^{t), E) = d{^{T), E) < e, where the inequality follows 
from (5.18). Since e > 0 was chosen arbitrarily, it follows that lim d{^{t), E) = 0. 

t—)-oo 

Finally, we show that limt_,.oo d{g(t), E) = 0. Note that by (5.8), \\gi{t) — Ci(t — 
1)11 < Mipi{t), Vi, where Mi := inaxp/^p/^gAi \\p' — p"\\ is a constant. Invoking as¬ 
sumption A.l gives, lim \\gi{t) — ^i{t — 1)|| = 0, Vi. Combining this with the fact 

t—^OO 

that lim d{^{t), E) = 0 yields the desired result, limt_>oo d{g{t), if) = 0. □ 

t—^OO 

6. Applications of the General Result. In this section we consider three 
different FP-type algorithms and study the strongly convergent variant of each. In 
each case, we prove strong convergence by showing that the FP-type algorithm fits the 
template of Theorem 5.1. Generally, the only non-trivial aspect of applying Theorem 
5.1 will be to show that A.8 is satisfied. 

In Section 6.1 we consider classical FP. The fact that classical FP satisfies A.8 
was shown by Leslie et al. [11]. In Section 6.2 we consider GWFP—a generalization 
of FP proposed in [11]. Again, the crucial step of showing that GWFP satisfies A.8 
was shown in [11]. In Section 6.3 we consider a variant of FP termed EGFP. That 
ECFP satisfies A.8 was shown in [34]. We emphasize that each of these algorithms 
is known to achieve weak learning in the sense that d{^{t), if) —>• 0 as t —>■ oo. Our 
contribution is to construct a variant where players also achieve learning in the strong 
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sense that period-by-period mixed strategies also converge to equilibrium. 

6.1. Strong Convergence in Classical FP. We now prove Corollary 1 using 
the general convergence result of Theorem 5.1. 

Proof. Classical FP fits the template of an FP-type algorithm with the empirical 
distribution given by qi{t) = y X]!=i ^^e functions /f and ff given by the 

identity function for each and the best response perturbation given by Ct = 0, Vt. 
To show that the strongly convergent variant of classical FP attains strong learning, 
it suffices to show that the assumptions of Theorem 5.1 are met. 

To that end, note that A.1-A.3 are satisfied by assumption, and A.10 is trivially 
satisfied (with rjt = 0, Vt). Furthermore, the empirical distribution sequence satisfies 
limt_).oo \\qi{t) — qi{t — 1)11 = 0 (see Section 5.2.1), and hence A.5 is satisfied. The 
functions ff and ff (each being the identity function) satisfy A.6-A.7. Therefore, it 
is sufficient to show that A.8 is satisfied. But, for zero-sum games, potential games, 
and generic 2 x m games this holds by [11], Corollary 5. □ 

6.2. Strong Convergence in Generalized Weakened FP. GWFP was in¬ 
troduced in Section 5.2.2, where it was shown to fit the template of an FP-type 
algorithm. 

Since, by definition, a GWFP process allows players to choose an et sub-optimal 
best response with et —t 0, the following result ( [11], Gorollary 5) guarantees a GWFP 
process satisfies A.8 in the noted classes of games. 

Theorem 6.1. Any generalized weakened fictitious play process will converge to 
the set of Nash equilibria in two-player zero-sum games, potential games, and generic 
2 X m games. 

To clarify the precise meaning of the convergence stated above as it relates to the 
present work, we emphasize that Theorem 6.1 implies that limt_>oo d{q{t), NE) = 0; 
i.e., the process converges weakly to equilibrium. 

Let the strongly convergent variant of GWFP be constructed using the approach 
laid out in Section 5.3. The following Gorollary to Theorem 5.1 states that the strongly 
convergent variant of a GWFP process will achieve strong learning. 

Corollary 2. Let T be a two-player zero-sum game, potential game, or generic 
2 X m game. Let 'b be an instance of GWFP. If the strongly convergent variant 
ofsatisfies A.l A3 and A.10, then it achieves strong learning in the sense that 
limt_>oo d{g{t), NE) = 0. 

Proof. It is sufficient to show that the conditions of Theorem 5.1 are met. Note 
that A.1-A.3, A.10 hold by assumption. Furthermore, by definition, any GWFP 
process satisfies limt_>oo 7(0 = 0, and hence satisfies A.5. The functions ff and ff 
are given by the identity function for each i, and hence A.6 and A.7 hold. Thus, 
it suffices to show that A.8 holds for the specified class of games—but, this follows 
from Theorem 6.1. □ 

6.3. Strong Convergence in Empirical Centroid FP. ECFP was intro¬ 
duced in Sections 5.2.3 and 5.2.4. It In order to study the asymptotic behavior of 
EGFP (in either of the above formats introduced in Sections 5.2.3 and 5.2.4) we make 
the following assumption regarding the structure of players’ utility functions: 

A. II. The players’ utility functions are identical and permutation invariant. 
That is, foranyi,j € N, ufy) = Uj{y), andu{[y']i, [y"]j,y-(ij)) = u(\y"]i, [y']j,y-(ij)), 

should be noted that classical FP may be seen as an instance of GWFP, and thus Corollary 
1 may in fact be deduced as a corollary to Corollary 2. However, for clarity and continuity of 
presentation, the results regarding classical FP have been presented separately. 
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where, for any player k G N, the notation [y']k indicates the action y' G being 
played by player k, and y_(^i j^ denotes the set of actions being played by all players 
other than i and j. 

We note that, under this assumption, the sets C and MCE are nonempty [12, 
33]. The following theorem ( [34], Theorem 1) specifies the manner in which players 
engaged in an ECFP process (weakly) learn elements of the sets C and MCE. 

Theorem 6.2. Let {a(t)}t>i be an ECFP process. 

Assume T is such that A.9 and A. 11 hold. Then players learn eguilibrium strategies 
in the sense that (i) d{q'^{t), C) = 0, and (ii) d{q{t), MCE) = 0. 

Note that case (i) above corresponds to ECFP with the convergence map ff as 
given in Section 5.2.3, and case (ii) corresponds to the convergence map f^ given by 
the identity function (as in Section 5.2.4). Since, by definition, an ECFP process 
(5.3) allows players to choose actions from the et-sub-optimal best response set with 
Et 0, Theorem 6.2 ensures that ECFP satisfies A. 8. 

Let 4' be an instance of ECFP as presented in either Section 5.2.3 or Section 
5.2.4, and let the strongly convergent variant of 4^ be constructed using the approach 
laid out in Section 5.3. The following corollary to Theorem 5.1 states that players 
engaged in the strongly convergent variant of an ECFP process learn elements of C 
and MCE in the strong sense that players’ period-by-period strategies converge to 
equilibrium. 

Corollary 3. (i) Let 4' be an instance of ECFP with f^{q) = 

and assume P is such that A.9 and A. 11 hold. If the strongly convergent variant 
o/4' satisfies A.l A.3 and A.10, then it achieves strong learning in the sense that 
\im.t^od{g{t), C)=0. 

(ii) Let 4* be an instance of ECFP with ff{q) given by the identity function for all 
i and assume P is such that A.9 and A. 11 hold. If the strongly convergent variant 
ofsatisfies A.l-A.3 and A. 10, then it achieves strong learning in the sense that 
limt_).o d{g{t), MCE) = 0. 

Proof. Cases (i) and (ii) differ only in terms of the function ff{t) and target 
equilibrium set E. However, in both cases the function ff satisfies A.7. It suffices 
to show the remaining conditions of Theorem 5.1 are satisfied. Henceforth we treat 
cases (i) and (ii) equivalently. 

Note that A.1-A.3 and A.10 hold by assumption. The empirical distribution 
sequence satisfies \\qi{t)—qi{t—l)\\ < ^ —>• 0 as t —oo, where := supp/_p//g^. Up'— 
p"jj, and hence A. 5 is satisfied. Note that the function /f (g) = 9j satisfies A. 6. 
Finally, Theorem 6.2 shows that A.8 is satisfied. □ 

7. Conclusions. An algorithm is said to achieve weak learning if players learn 
an equilibrium strategy in an abstract sense (see Section 2), but period-by-period 
strategies do not necessarily converge to equilibrium. An algorithm is said to achieve 
strong learning if (additionally) players’ period-by-period strategies converge to equi¬ 
librium. Weak learning may be thought of as a form of learning where players learn a 
strategy in some abstract sense, but never begin to implement the strategy they are 
learning. On the other hand, in strong learning, not only do players learn a strategy, 
but they also physically implement the learned strategy through the course of the 
learning process. 

Fictitious Play (FP) and its variants are known to exhibit weak learning but 
not necessarily strong learning. An approach was presented for taking a general 
FP-type algorithm that achieves weak learning, and constructing from it a strongly 
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convergent variant of the algorithm. General convergence results were proved and 
used to construct a strongly convergent variant of several example FP-type processes. 

In order to apply the convergence results proved in this paper, it is necessary 
to ensure a candidate algorithm meets A.8 (the other necessary assumptions are 
relatively trivial to verify). An interesting future research direction might be to in¬ 
vestigate other FP-type algorithms (e.g., [32,35]) and verify whether they meet the 
assumptions sufficient for construction of a strongly convergent variant. 

Appendix. 

7.1. Some Useful Inequalities. We consider some useful inequalities related to the 
strongly convergent variant of an FP-type algorithm. We restrict atteirtion to realizations 
u) G fl' ■ Let {qi{t)}t>i be given by (5.6). By A.5 there exists a sequence 7 (t) such that 
lim 7 (t) = 0, and for each i £ N, 

t—¥oo 

\\qiit + l) - qi{t)\\ < Mi 7 (fi(t)), (7.1) 

where Mi := sup,j/ \\q' — q''\\- Similarly, there holds for any integer s > 0, 

||g(s + 1) - q(s)|| < M 7 (s), (7.2) 

where M := sup,j/||g' — g"||. More generally, for any integers si, S 2 > 0, if A.5 holds 
then, 

max{si ,S2 } “ 1 

lla(sr) - g(s 2 )|| < M ^ 7(s) < |si - S 2 |B, (7.3) 

s=min{ Si ,S 2 } 

where 0 < B < oo is such that supj 7 (t) < B/M. 

7.2. Intermediate Results. 


Lemma 7.1. Let ri{s) be defined as in section 5.5, and assume A. 10 holds. Then for 
any realization in Ll' there holds, lim \Ui{ai{Ti{s)),pi{Ti{s) — 1)) — Vi{pi{Ti{s) — 1))\ = 0, Vi. 

3 —>-00 

Proof. Let s € N. Note that by defiirition Ti{s) := inf{t : £i{t) = s} and £i{t) := 
A'i(A:), thus Xi{Ti{s)) = 1. By (5.7) this implies ai(ri(s)) = fei(ri(s)) G (Pi('ri(s) — 

1)), which implies \Ui{ai{Ti{s)),Pi{Ti{s) - 1)) - ufipfirfis) - 1))| < r]Ti{s)- By A.10, r?* -)■ 0 
as t —>■ oo, and moreover, by (5.11), Ti(s) —>■ oo as s —>■ oo. Thus 7x^(8) ^ 0 as s —>■ oo, and 
the claim holds. □ 

Lemma 7.2. Let i,j G N, let Ti{s) and qj{s) be defined as in Section 5.5, and assume 
A.2-A.3 hold. Then for any realization in 17', lims-joc ||©(''"i('S)) ~ ® ('S)II = 0- 

Proof. Note that for any t G N, qj{t) — qj{rj{£j{t))) = qj{£j{t)), where the first equality 
follows from Lemma 7.4, and the second equality follows from the definition of qi{s). Hence, 

ll®(T-i(s)) - gj(s)|| = \\qj{£j{Ti{s))) - ®(s)|| = ||©(^j(Ti(s))) - ® (^i(Ti(s)))|| 

< \^i{Ti{s)) - ii{Ti{s))\B, 

where the first equality follows from the previous statement, and the second equality 
follows from the fact that £i{ri{s)) = s (see (5.12)), and the final inequality follows from 
(7.3). Thus, it suffices to show that 

lim \£j{Ti{s)) - £i{Ti{s))\ = 0. (7.4) 

s —^OO 

For convenience in notation let hi{t) := Lemma 7.6 and the definition 

of 17' there holds for any k £ N, limt_>oo ~ 1- By assumption A.3, for any k £ N, 
limt^oo {hk{t) / {hi{t)) = 1. Hence, for any k £ N, 


lim 

t—^OO 


£k{t) 

hi{t) 


lim 

t—^OO 


£k{t) hk{t) 
hk{t) hi{t) 


= 1 . 


(7.5) 


Returning attention to (7.4) and recalling that by (5.11), lims_>oo rfis) = oo on 17', we have, 
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limsup |^j(ri(s)) — ^i(Ti(s))| < limsup 

s —^OO t —^oo 


\ = limsup 

t^OO 

= lim sup 


\hi{t) - hi{t)\ = 0, 




where the transition to the last line follows from application of (7.5). Thus, (7.4) is verified, 
and the desired result holds. □ 

Lemma 7.3. Let i,j G N, let qj{s) and qj{s) be defined as in Section 5.5, and assume 
A.2~A.3 hold. Then for any realization in Q' there holds lims-j-oo || 9 )(s) — gj(s)|| = 0. 

Proof. Recall that by definition, qj{s) — qj{Ti{s + 1) — 1); our objective then is to show 
that lims-).oo \\qj{ri{s + 1) — 1) — gj(s)|| = 0. By Lemma 7.2, lim ||gj(Ti(s)) - gi(s)|| = 0. By 

S—^QO 

(7.2) and A.5 there holds, lim ||gj(s + 1) — qj(s)|| = 0. Combining this with the previous 

s—>-oo 

statement, 


lim \\qj{Ti{s + 1)) - ®(s)|| = 0. (7.6) 

S—^OO 

Recalling (7.1), there holds. 


limsup \\qj{ri{s + 1) - 1) - qj{Ti{s + 1))|| < lim sup Mj 7 (^j (ri(s + 1))) = 0, (7.7) 

s —^OO s^oo 

where the equality holds since lims-xxj ij{Ti{s)) = oo on Q', and by A.5, lims->oo 7 (s) = 0. 
Consider now the quantity of interest. 


||qj(ri(s + 1) - 1) - ®(s)|| <\\qj{Ti{s + 1) - 1) - qj{Ti{s + 1))|| + \\qj{Ti{s + 1)) - ®(s)||. 

The first term on the right hand side (RHS) goes to zero by (7.7) and the second term on 
the RHS goes to zero by (7.6). Thus, lim \\qj{Ti{s + 1) — 1) — gj(s)|| = 0, and the claim 

S^OO 

holds. □ 

Lemma 7.4. Let i £ N, let qi{-) be as defined in (5.6), let ii{-) be as defined in (5.4), 
and let ri(-) be as defined in (5.5). Then for every realization in Q' and any t £ {1,2,...} 
there holds qi{Ti{li{t))) = qi(t). 

Proof. Let to := Ti{£i{t)) = inf{T : £i{t') = ii{t)}, where the second equality follows from 
the definition of ri(-). Note that to < t and by definition of to, there holds rfiliito)) = to, 
and hence qi{ri{£i{to))) = qiito). Furthermore, by the definition of to, for to < t' < t, 
there holds £i{t) = li{t') = £i{to), and hence rfiifit)) = rfiifito)). Moreover, the fact that 
£i{t) = £i{t') = li{to) implies by definition of £;(•) that Xi{t') = 0 for to < t' < t (if 
such a t' exists). Thus, by (5.14) there holds qi{t) = qi(t') = qi{to) for to < t' < t, and 
in particular qi{t) = qi{to). Combining this with the facts that qi{Ti{£i{to))) = qi{to) and 
Ti{£i{t)) = ri{£i{to)) yields the desired result. □ 

Lemma 7.5. Let 'F = ({/f (•, t)}t>i,/f,//)ieAf be an FP-type algorithm, and let the 
strongly convergent variant of'll be constructed as in Section 5.3. Let d[s), H{s), and qi{s) be 
as defined in Section 5.5. Then for every realization in Ll', and for s > 1, qi{s) = f^{H{s), s). 

Proof. For s > 1, note that qi{s) = gi(ri(s)) = fi{Hi{Ti{s)),£i{ri{s))) = f^{H{s),s), 
where the first equality follows from the definition of qi{s) in Section 5.5, the second follows 
from A.4, and the third follows from the definition of Hi{s) in Section 5.5 and (5.12). □ 
Lemma 7.6. Let {A(t)}t>i 6e 0 — 1 Bernoulli random variables, let £{i) := ^(^) 

be the associated counting process, let Qt ~ a{{X{k)}j.^i), and let p{t) = P(A(t) = l\Qt-i). 
Assume X]t>i ~ Then there holds, lim {£{t)) / P(^)) = 1) 

Proof. The result follows via Levi’s extension of the Borel-Cantelli Lemmas, [30] p.l24. 

□ 


REFERENCES 

[1] G. W. Brown. “Iterative Solutions of Games by Fictitious Play” In Activity Analysis of Pro¬ 

duction and Allocation. Wiley, New York, 1951. 

[2] J. Robinson. An iterative method of solving a game. Ann. Math., 54(2):296-301, 1951. 


20 






[3] U. Berger. Fictitious play in 2xn games. Journal of Economic Theory, 120(2):139-154, 2005. 

[4] P. Milgrom and J. Roberts. Rationalizability, learning, and equilibrium in games with strategic 

complementarities. Econometrica, 58(6):1255-1277, 1990. 

[5] U. Berger. Learning in games with strategic complementarities revisited. Journal of Economic 

Theory, 143(1):292-301, 2008. 

[6] A. Sela and D. Herreiner. Fictitious play in coordination games. International Journal of Game 

Theory, 28(2):189-197, 1999. 

[7] D. Monderer and L. Shapley. Potential Games. Games and Econ. Behav., 14(1):124-143, 1996. 

[8] M. Benai’m, J. Hofbauer, and S. Sorin. Stochastic approximations and differential inclusions. 

SIAM J. Gontrol and Optim,, 44(l):328-348, 2005. 

[9] J. S. Jordan. Three problems in learning mixed-strategy Nash equilibria. Games and Econ. 

Behav., 5(3):368-386, 1993. 

[10] H. P. Young. Strategic learning and its limits, volume 2002. Oxford University Press, 2004. 

[11] D. S. Leslie and E. J. Collins. Generalised weakened fictitious play. Games and Econ. Behav., 

56(2):285-298, 2006. 

[12] B. Swenson, S. Kar, and J. Xavier. Empirical centroid fictitious play: an approach for dis¬ 

tributed learning in multi-agent games. Accepted for publication in IEEE Transactions on 
Signal Processing, http://arxiv.org/abs/1304.4577, 2012. 

[13] B. Swenson, S. Kar, and J. Xavier. Distributed learning in large-scale multi-agent games: A 

modified fictitious play approach. In fdth Asilomar Gonference on Signals, Systems, and 
Computers, pages 1490 - 1495, Pacifc Grove, CA, USA, 2012. 

[14] D. Fudenberg and D. K. Levine. The Theory of Learning in Games, volume 2. MIT press, 

1998. 

[15] J. R. Marden, G. Arslan, and J. S. Shamma. Joint strategy fictitious play with inertia for 

potential games. IEEE Trans. Automat. Contr., 54(2):208—220, 2009. 

[16] J. R. Marden, H. P. Young, G. Arslan, and J. S. Shamma. Payoff based dynamics for multi¬ 

player weakly acyclic games. SIAM J. Control and Optim., 48(l):373-396, 2009. 

[17] G. C. Chasparis, A. Arapostathis, and J. S. Shamma. Aspiration learning in coordination 

games. SIAM J. Control and Optim., 51(l):465-490, 2013. 

[18] B. Pradelski and H. P. Young. Learning efficient Nash equilibria in distributed systems. Games 

and Econ. Behav., 75(2):882-879, 2012. 

[19] J. R. Marden and J. S. Shamma. Revisiting log-linear learning: Asynchrony, completeness and 

payoff-based implementation. Games and Econ. Behav., 75(2):788-808, 2012. 

[20] S. Rass and B. Rainer. Numerical computation of multi-goal security strategies. In Decision 

and Game Theory for Security, pages 118-133. Springer, 2014. 

[21] M. Voorneveld. Pareto-optimal security strategies as minimax strategies of a standard matrix 

game. J. Optimiz. Theory App., 102(1):203-210, 1999. 

[22] T. Alpcan and T. Basar. Network Security: A Decision and Game-Theoretic Approach. Gam- 

bridge University Press, 2010. 

[23] K. Dabcevic, A. Betancourt, L. Marcenaro, and G. S. Regazzoni. A fictitious play-based game- 

theoretical approach to alleviating jamming attacks for cognitive radios. In Acoust. Speech, 
Signal Proc. (ICASSP), 2014 IEEE Int. Conf. on, pages 8158-8162. IEEE, 2014. 

[24] O. Candogan, A. Ozdaglar, and P. A. Parrilo. Dynamics in near-potential games. Games and 

Econ. Behav., 82:66—90, 2013. 

[25] D. P. Foster and H. P. Young. Regret testing: A simple payoff-based procedure for learning 

Nash equilibrium. University of Pennsylvania and Johns Hopkins University (mimeo), 
2003. 

[26] F. Germano and G. Lugosi. Global Nash convergence of Foster and Young’s regret testing. 

Games and Econ. Behav., 60(1):135—154, 2007. 

[27] D. Fudenberg. Learning mixed equilibria. Games and Econ. Behav., 5(3):320-367, 1993. 

[28] J. Hofbauer and W. H. Sandholm. On the global convergence of stochastic fictitious play. 

Econometrica, 70(6):2265-2294, 2002. 

[29] B. Swenson, S. Kar, and J. Xavier. Strong convergence to mixed equilibria in fictitious play. In 

Information Sciences and Systems, fSth Annual Conference on, pages 1—6. IEEE, 2014. 

[30] D. Williams. Probability with Martingales. Gambridge University Press, 1991. 

[31] D. Monderer and L. S. Shapley. Fictitious play property for games with identical interests. 

Journal of Economic Theory, 68(l):258-265, 1996. 

[32] G. Arslan and J. S. Shamma. Distributed convergence to Nash equilibria with local utility 

measurements. In Proc. of the fS^d IEEE Conf. on Decision and Control, volume 2, pages 
1538 - 1543, 2004. 

[33] B. Swenson, S. Kar, and J. Xavier. Mean-centric equilibrium: An equilibrium concept for 

learning in large-scale games. In IEEE Glob. Conf. Signal Inf. Process., pages 571-574, 


21 


2013. 

[34] B. Swenson, S. Kar, and J. Xavier. On robustness properties in empirical centroid fictitious 

play. Submitted for conference publication, http://arxiv.org/abs/1504.00391, 2015. 

[35] J. S. Shamma and G. Arslan. Dynamic fictitious play, dynamic gradient play, and distributed 

convergence to Nash equilibria. IEEE Trans. Automat. Contr.., 50(3):312—327, 2005. 


22 


